1 Motivation

The hotel industry has always been a very complex sector which requires a continuous improvement of the management methods used. Since the pandemic crisis, many changes have impacted this sector, not to mention the exacerbated evolution of customer expectations which are becoming more and more demanding. Therefore, hotels need to adapt their operations to this challenging context.

Demand forecasting is a new method to mitigate the impact of these constraints. Due to its benefits (efficient planning, better decision making, pricing…) it is essential for hotels to apply it.

The main objective of this project is the development of a predictive model that allows hotels to anticipate their market demand to be able to adapt their activity, optimize their costs and maximize their profits in a given period of time.

2 Data source

The data is originally from the article Hotel Booking Demand Datasets, written by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019. This data article describes two datasets with hotel demand data in Portugal. One of the hotels (H1) is a resort hotel that is in the region of Algarve and the other is a city hotel (H2) which is located in Lisbon. Both datasets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. Since this is hotel real data, all data elements pertaining hotel or costumer identification were deleted.

3 Analysis plan

To answer our problematic question, it is important for us to explore the variables of the given dataset. Therefore, we are going to answer a number of sub-questions for a better understanding of our data, for instance:

In this way, we will be able to choose the most relevant variables that will serve as predictors for our forecasting demand model.

This project will be structured as follows:

First part of the project (R)

Second part of the project (Python)

4 Import and process data

4.1 Import data set

df <- read.csv(file="hbook.csv", skip=0, na="")
df

Columns specification

In this part we are going to check the columns specification and convert all the misidentified ones into their right types.

  • Check columns specification
str(df)
## 'data.frame':    119390 obs. of  32 variables:
##  $ hotel                         : chr  "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr  "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : chr  "0" "0" "0" "0" ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr  "BB" "BB" "BB" "BB" ...
##  $ country                       : chr  "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr  "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr  "C" "C" "C" "A" ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr  "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr  "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr  "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr  "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr  "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : chr  "7/1/15" "7/1/15" "7/2/15" "7/2/15" ...
  • Change columns type

From our data frame we can see that “agent” and “reservation_status_date” are misspecified. Therefore, we need to re-import our data set in order to define the correct type of the 2 variables

# Change the type of date column & agent as numeric
df<-read_csv(file='hbook.csv', skip=0, col_names = T, na="", cols(
  hotel = col_character(),
  is_canceled = col_double(),
  lead_time = col_double(),
  arrival_date_year = col_double(),
  arrival_date_month = col_character(),
  arrival_date_week_number = col_double(),
  arrival_date_day_of_month = col_double(),
  stays_in_weekend_nights = col_double(),
  stays_in_week_nights = col_double(),
  adults = col_double(),
  children = col_double(),
  babies = col_double(),
  meal = col_character(),
  country = col_character(),
  market_segment = col_character(),
  distribution_channel = col_character(),
  is_repeated_guest = col_double(),
  previous_cancellations = col_double(),
  previous_bookings_not_canceled = col_double(),
  reserved_room_type = col_character(),
  assigned_room_type = col_character(),
  booking_changes = col_double(),
  deposit_type = col_character(),
  agent = col_double(),
  company = col_character(),
  days_in_waiting_list = col_double(),
  customer_type = col_character(),
  adr = col_double(),
  required_car_parking_spaces = col_double(),
  total_of_special_requests = col_double(),
  reservation_status = col_character(),
  reservation_status_date = col_date(format = "%m/%d/%y")
))

Create the arrival date variable

In our data frame the arrival date of guests is defined by 3 separate variables. So we need to create a new arrival date column that combines them and define its correct format format (as.date)

# Create a new variable to replace the names of months with numbers 
df1 = mutate(df, Months = case_when(arrival_date_month=="January"~01, arrival_date_month=="February"~02, arrival_date_month=="March"~03, arrival_date_month=="April"~04, arrival_date_month=="May"~05, arrival_date_month=="June"~06, arrival_date_month=="July"~07, arrival_date_month=="August"~08, arrival_date_month=="September"~09, arrival_date_month=="October"~10, arrival_date_month=="November"~11, arrival_date_month=="December"~12 ))

# Combine the columns and create the arrival date column 
df1$arrival_date <- paste(df1$arrival_date_year, df1$Months, df1$arrival_date_day_of_month)

# Convert as date the "arrival_date" variable 
df1$arrival_date <- as.Date(df1$arrival_date, "%Y%m%d")

4.2 Handling missing values

In order to clean our data, it is necessary to check all the missing values and handle them before we start our Exploratory Data Analysis (EDA).

Checking missing values

In our data frame “NAs” are represented by the value “NULL”. Thus, we need first to replace all of them with “” (empty space) so that we can compute the number of missing values.

#Replace all "NULL" values with "" 
nrow(df1[,1:31,33] == "NULL") 
## [1] 119390
df1[,1:31,33][df1[,1:31,33] == "NULL"] <- NA

#Check missing values 
colSums(is.na(df1))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                            488 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                          16340 
##                        company           days_in_waiting_list 
##                         112593                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0 
##                         Months                   arrival_date 
##                              0                              0
plot_missing(df1)

Handling missing values

To clean our data frame we are going: + to delete the variable company because we have more than 93% of missing values, + to remove the rows of the variable agent because we can’t replace missing agent IDs with another value, + to remove the rows of the variables country and children because their percentage is very low.

#Deleting the column "company"
df1_cl <- select (df1, -company)
                  
#Deleting the rows with missing values
df1_cl = df1_cl%>% filter(!is.na(agent), !is.na(country), !is.na(children))

#Check missing values again
plot_missing(df1_cl)

Our Data is now clean.

5 Exploratory data analysis (EDA)

5.1 Explain the variables

  • Hotel: include 2 type of Hotels (H1 = Resort Hotel or H2 = City Hotel)

  • is_canceled: Value indicating if the booking was canceled (1) or not (0)

  • lead_time: Number of days that elapsed between the entering date of the booking into the PMS and the arrival date.

  • arrival_date: Date on which guests arrived at destination (the variable we created).

  • arrival_date_week_number: Week number of year for arrival date.

  • stays_in_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel.

  • stays_in_week_nights: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel.

  • adults: Number of adult guests.

  • children: Number of children.

  • babies: Number of babies.

  • meal: Type of meal booked. Categories are presented in standard hospitality meal packages: Undefined/SC – no meal.

  • country: Country of origin. Categories are represented in the ISO 3155–3:2013 format (Alpha-3 code).

  • market_segment: Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”.

  • distribution_channel: Booking distribution channel (it can be any method or platform by which a hotel sells its rooms). The term “TA” means “Travel Agents” and “TO” means “Tour Operators”.

  • is_repeated_guest: Value indicating if the booking name was from a repeated guest (1) or not (0)

  • previous_cancellations: Number of previous bookings that were cancelled by the customer prior to the current booking.

  • previous_bookings_not_canceled: Number of previous bookings not cancelled by the customer prior to the current booking.

  • reserved_room_type: Code of room type reserved. Code is presented instead of designation for anonymity reasons.

  • assigned_room_type: Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons.

  • booking_changes: Number of changes/amendments made to the booking from the moment the booking was entered on the PMS.

  • deposit_type : Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay.

  • agent ID: ID of the travel agency that made the booking.

  • company ID: ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons.

  • days_in_waiting_list: Number of days the booking was in the waiting list before it was confirmed to the customer.

  • customer_type: Type of booking, assuming one of four categories: Contract- when the booking has an allotment or other type of contract associated to it ; Group- when the booking is associated to a group ; Transient- when the booking is not part of a group or contract, and is not associated to other transient booking ; Transient-party- when the booking is transient, but is associated to at least other transient booking.

  • adr: Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights.

  • required_car_parking_spaces: Number of car parking spaces required by the customer.

  • total_of_special_requests: Number of special requests made by the customer (e.g. twin bed or high floor).

  • reservation_status: Reservation last status, assuming one of three categories: Canceled– booking was canceled by the customer; Check-Out– customer has checked in but already departed; No-Show– customer did not check-in and did inform the hotel of the reason why.

  • reservation_status_date: Date at which the last status was set. This variable can be used in conjunction with the “reservation_status” to understand when was the booking canceled or when did the customer checked-out of the hotel.

Exclude some variables

To forecast the demand we will only use the validated bookings (not cancelled). Therefore, we are going to filter only the reservations that are not cancelled, and also, from our initial explanation of the variables, we can already exclude some of them that provide information on cancellations, and that we have considered not relevant enough to develop our forecasting demand model.

# Create a new data frame:
df2 <- df1_cl %>% filter(is_canceled ==0) %>% select(-is_canceled, -lead_time, -previous_cancellations, -previous_bookings_not_canceled, -reserved_room_type, -deposit_type, -days_in_waiting_list, -reservation_status, -reservation_status_date, -Months)

5.2 Hotels comparison

We are going to compare the 2 types of hotels (City hotels & Resort hotels) in order to choose the one with the higher number of observations.

# Plot a histogram
df2$hotel <- as.factor(df2$hotel) #to convert as a factor 

ggplot(df2, aes(x=df2$hotel)) +  
  geom_bar(width=0.2, color="orange", fill="orange") + 
  ggtitle("Hotels comparison")

Following the results illustrated on the graph above, we have considered that it is relevant to focus only on the city hotel. Therefore, the next command will help us to create a new data frame with only this value (city hotel)

# Create a new data frame of "city hotel
df_CH <- filter(df2, hotel == "City Hotel")

5.3 Market segments

What are the major market segments?

To understand the market segments of our data frame, we are going to plot a pie chart which shows the size of each segment as a percentage.

# Calculate the percentages 
df_market <- df_CH %>% 
  group_by(market_segment) %>% # Variable to be transformed
  count() %>% 
  ungroup() %>% 
  mutate(perc = `n` / sum(`n`)) %>% 
  arrange(perc) %>%
  mutate(labels = scales::percent(perc))

# Plot the pie chart
ggplot(df_market, aes(x = "", y = perc, fill = market_segment)) +
  geom_col(width = 1) +
  geom_label(aes(label = labels),
             position = position_stack(vjust = 0.2),
             show.legend = FALSE) +
  coord_polar(theta = "y") + 
  ggtitle("Market segments comparison")

According to the pie chart above, out of the 7 market segments, the major ones are: + “Online TA” (Online Travel agencies) with a percentage of 52.47% + “Offline TA/TO” (Offline traditional travel agencies / Tour Operators) with a percentage of 20.71% + “Direct” (Direct bookings with the hotel done over phone calls or emails) with a percentage of 10.90% + “Groups” (customers that combine their stay with an event or other form of activity) with a percentage of 9.41%

5.4 Guests & their preferences

In this part, our goal is to have a better acknowledge of type of guests that the city hotel is receiving and their general preferences.

Where are the most guests coming from (Countries of origin)?

We are going to plot a world map (choropleth map) that displays the number of guests by country of origin.

  • Step 1: Create a new data frame where we calculate the total number of guests per country of origin ;

  • Step 2: We already downloaded a world shapefile in our home directory, and now we need to load it in R ;

  • Step 3: We need to fortify the world data, which means transform our geospatial object “world_spdf” (s4 class list) to a data frame using the tidy function of the broom library, so that we can plot it with “ggplot2”. We should also select our key id (alpha-3 code) to be able to merge it with the other data frame (“df_country”) ;

  • Step 4: Merge the 2 data frames using the id in common (alpha-3 code). We need also to delete the NAs ;

  • Step 5: Customize the choropleth map representing the number of guests per origin country.

# Create data frame 
df_CH$country<-as.factor(df_CH$country)
df_country <- df_CH %>% group_by(country) %>% summarise(Number_of_guests = sum(adults,na.rm = TRUE)) 

# Load the world shapefile 
world_spdf =readOGR( '/Users/khalil/Desktop/JUB/Tools/world_shape_file/TM_WORLD_BORDERS_SIMPL-0.3.shp')

# Fortify the world data & keep trace of the commune code (id)
world_spdf_fortified <- tidy(world_spdf, region = "ISO3")

# Merge the 2 data frames 
world_spdf_fortified1 = world_spdf_fortified %>%
  left_join(. , df_country, by=c("id"="country"))

# Delete the NAs
world_spdf_fortified1$Number_of_guests[ is.na(world_spdf_fortified1$Number_of_guests)] = 0.001

# Customize our world map
ggplot() +
  geom_polygon(data = world_spdf_fortified1, aes(fill = Number_of_guests, x = long, y = lat, group = group) , size=0, alpha=0.9) +
  theme_void() +
  scale_fill_viridis(trans = "log", breaks=c(1,50,100,1000,5000,10000,15000), name="Number of guests", guide = guide_legend( keyheight = unit(3, units = "mm"), keywidth=unit(12, units = "mm"), label.position = "bottom", title.position = 'top', nrow=1) ) +
  labs(
    title = "Number of guests per origin country",
  ) +
  theme(
    text = element_text(color = "#22211d"),
    plot.background = element_rect(fill = "#f5f5f2", color = NA),
    panel.background = element_rect(fill = "#f5f5f2", color = NA),
    legend.background = element_rect(fill = "#f5f5f2", color = NA),

    plot.title = element_text(size= 20, hjust=0.01, color = "#4e4d47", margin = margin(b = -0.1, t = 0.4, l = 2, unit = "cm")),
    legend.position = c(0.5, 0.09)
  ) +
  coord_map()

As we know the city hotel is located in Lisbon in Portugal. According to the map, the highest number of guests are coming from Portugal, France, Germany and UK, followed by Spain, Italy, Mexico and Brazil. The rest of the guests come from different countries in different continents (for example: China, Russia, Australia, Middle east, Morocco, Argentina, Colombia…)

Comparison between customer types

# Count customer types
df_customer <- df_CH %>% 
  group_by(customer_type) %>%
  count()

# Plot a Bar chart
ggplot(df_customer, aes(x=customer_type, y=n, fill = customer_type)) +    
  geom_bar(stat = "identity") + ggtitle("Customer types comparison") + 
  labs(y="Count", x = "Customer types")

According to the graph the majority of the guests are “Transient”, followed by “Transient-Party” which is less than half of the first. The number of the other 2 (“Group” & “Contract”) is very low. Therefore, we can say that the duration of stay is generally short.

What is the type of customers in terms of their price sensitivity?

Here, we want to understand the type of customers, whether price is the main criteria that influences their decision or not. So to answer this question we will make a comparison of the most assigned rooms and also a comparison of the prices of each room to see if the cheapest rooms are the most assigned ones.

  • A comparison between all the room types according to the number of reservations

In this part, we are going to plot a bar chart that represents the different types of the assigned rooms

# Create a data frame to count the number of reservations 
df_room <- df_CH %>% 
  group_by(assigned_room_type) %>% # Variable to be transformed
  count()

# Plotting
ggplot(df_room, aes(x=assigned_room_type, y=n, fill = assigned_room_type)) +    
  geom_bar(stat = "identity") + ggtitle("Room types comparison") + 
  labs(y="Number of reservations", x = "Room types")

According to the bar chart above the most occupied rooms are: A & D

  • How does the price per night of each room vary over the year?
# Create a new data frame 
df_price_per_room <- df_CH %>% group_by(arrival_date_month, assigned_room_type) %>% summarise(mean_price_per_room = mean(adr,na.rm = TRUE)) 

#Sorting the months (the months are not sorted in our data frame)
Month.ordered <- ordered(df_price_per_room$arrival_date_month, month.name)

# Plotting 
ggplot(df_price_per_room, aes(x=Month.ordered, y=mean_price_per_room, group = assigned_room_type)) + 
  geom_line(aes(color = assigned_room_type), size = .3) +
  geom_point(aes(color = assigned_room_type),size = 2, shape = 18) +
  labs(fill = "") +
  theme(legend.position="bottom") + ggtitle("Evolution of prices per each room type")

  • The most expensive rooms are respectively: G & F
  • The cheapest room is: K
  • Rooms with highest fluctuations are: E, F, G and K

As we have seen, the most assigned rooms are A and D which are in a medium range of prices. Thus, we can say that these guests do not consider the price as the main criterion to stay at the hotel or not. They also consider other characteristics such as comfort, quality of service…

Is the number of repeated guests important?

# Calculate the percentages 
df_CH$is_repeated_guest <- as_factor(df_CH$is_repeated_guest) #To improve the the representation of the graph
df_reap <- df_CH %>% 
  group_by(is_repeated_guest) %>%
  count() %>% 
  ungroup() %>% 
  mutate(perc = `n` / sum(`n`)) %>% 
  arrange(perc) %>%
  mutate(labels = scales::percent(perc))

# Plot the pie chart
ggplot(df_reap, aes(x = "", y = perc, fill = is_repeated_guest)) +
  geom_col(width = 1) +
  geom_label(aes(label = labels),
             position = position_stack(vjust = 0.2),
             show.legend = FALSE) +
  coord_polar(theta = "y") + 
  ggtitle("Repeated Guests")

The number of repeated guests is insignificant (3%)

What is the total number of adults, children and babies?

# Calculate the total number of each category 
df_category <- summarise(df_CH, Adults = sum(adults, na.rm = TRUE), Children = sum(children, na.rm = TRUE), Babies = sum(babies, na.rm = TRUE))

# Pivoting (to be able to plot all the categories together)
df_category <- pivot_longer(df_category, c(Adults,Children,Babies) ,names_to = "Individuals", values_to = "Number")
  
# Plotting 
ggplot(data=df_category, aes(x=Individuals, y=Number)) +
  geom_bar(stat = "identity", width = 0.4, fill = "#FF6666") + ggtitle("Total number of adults, children and babies")

As we can see the number of children and babies is insignificant compared to the number of adults. Therefore, the variable “adults” will be our dependent variable in our predictive model.

5.5 Evolution of the demand & seasonality.

How did the market demand fluctuate from 2015 to 2017?

# Create Data frame
df_Dem <- df1 %>% filter(hotel == "City Hotel") %>% group_by(hotel, arrival_date_year, Months) %>% summarise(Demand = sum(adults, na.rm = TRUE))

# Combine month and year columns
df_Dem$Date <- as.yearmon(paste(df_Dem$arrival_date_year, df_Dem$Months), "%Y %m")

# Plotting
ggplot(df_Dem, aes(x=Date, y=Demand, group = hotel)) + 
  geom_line(aes(color = "blue")) +
  geom_point(aes(color = "red")) +
  labs(fill = "") +
  theme(legend.position="bottom") + ggtitle("Evolution of market demand")

In “July 2015” the demand increased sharply until it reached a first peak of more than 6000 guests. Then, as we approach the winter period, the demand has dropped drastically until reaching its lowest peak.

From “January 2016”, the demand started to increase rapidly again until it reached a new peak around May, then after, it started to decrease during the next 2 months. But, from “July 2016”, an increase took place to reach its 3rd peak in “August” before starting to drop sharply again until “January 2017”.

Then, similarly to the previous year, demand began to increase rapidly from “January 2017”, until reaching in “May 2017” its highest peak (+8000 guests) in these 3 years and then begin to fall once again.

Which months do City hotel have the highest (and lowest) number of guests?

# Create a new data frame 
df_CH$arrival_date_year <- as.factor(df_CH$arrival_date_year) #convert as factor to plot each year in a line
df_demand_months <- df_CH %>% group_by(arrival_date_year, arrival_date_month) %>% summarise(Demand_per_year = sum(adults,na.rm = TRUE)) 

# Sorting the months 
Month.ordered3 <- ordered(df_demand_months$arrival_date_month, month.name)

# Plotting 
ggplot(df_demand_months, aes(x=Month.ordered3, y=Demand_per_year, group = arrival_date_year)) + 
  geom_line(aes(color = arrival_date_year), size = .3) +
  geom_point(aes(color = arrival_date_year),size = 2, shape = 18) +
  labs(fill = "") +
  theme(legend.position="bottom") + ggtitle("Evolution of the demand over 2015, 2016 and 2017")

Taking into consideration the 2016 line (complete year) we can see the the demand is seasonal. It starts to increase from January (the lowest point) until it reaches its first peak in May. After that, it starts to drop temporarily, before starting to increase again from June until it reaches its second peak in August, then starts to decrease again.

How does the mean price per night vary over the year?

# Create a new data frame where we calculate the mean price
df_price <- df_CH %>% group_by(hotel,arrival_date_month) %>% summarise(mean_price = mean(adr,na.rm = TRUE))

# Sorting the months 
Month.ordered1 <- ordered(df_price$arrival_date_month, month.name)

# Plotting 
ggplot(df_price, aes(x=Month.ordered1, y=mean_price, group=hotel)) + #We specify "hotel" as "group" to be able to connect the points, because the months are considered as different groups
  geom_line( color="#69b3a2", size=2, alpha=0.9, linetype=2) +
  theme_ipsum() +
  ggtitle("Evolution of the mean price")

The prices increase in the spring and summer months and decrease in the autumn and winter seasons. The 2 most important peaks are respectively recorded during May and August. So, we can say that the prices vary significantly depending on the season.

Therefore, we can say that the prices vary according to the season in the same way as the demand. In other words, the price fluctuations depend on the level of demand which also varies according to the season.

What is the average duration of stay of the guests per month?

# Calculate the total duration of stay in a new column
df_stay1 <- mutate(df_CH, Total_duration = stays_in_weekend_nights + stays_in_week_nights)

# Reshape our data frame to plot the box chart including 3 variables
df_stay2 <- melt(df_stay1,id.vars='arrival_date_month', measure.vars=c('stays_in_weekend_nights','stays_in_week_nights', 'Total_duration')) 

# Sorting months
Months <- ordered(df_stay2$arrival_date_month, month.name)

# Box plot
ggplot(df_stay2) +
      geom_boxplot(aes(x=Months, y=value, color=variable)) + ggtitle("Average duration of stay per month")

As it is shown in this box plot, the average total duration doesn’t exceed more than 5 days and it is approximately the same in every month, but the dispersion remains important.

The highest number of days of stay (more than 40 days) are respectively registered in the following months: September, March, July and January. And the shortest duration is reported in December.

5.6 Correlation.

Remove some variables

To evaluate the correlation, we need to exclude some variables: + Hotel: because we have chosen to focus only on one type of hotel. + Those that give us information about the date: arrival_date_year, arrival_date_month, arrival_date_week_number, arrival_date_day_of_month. + countries : categorical variable + those that we considered insignificant in the EDA part: Children, babies & repeated guests.

# Remove variables 
df_CH1 <- select (df_CH, -hotel, -arrival_date_year, -arrival_date_month, -arrival_date_week_number, -arrival_date_day_of_month, -country, -children, -babies, -is_repeated_guest) 

Convert from categorical variables to numerical variables

As it is only possible to check the correlation of numerical variables, we considered it relevant to transform some categorical variables into numerical ones because of their theoretical impact in predicting the hotel market demand.

These variables are: market_segment ; assigned_room_type ; customer_type ; distribution_channel

# Convert to numerical variables by creating new columns 
df_CH2 <- df_CH1 %>% 
  mutate(market_segment_num = case_when(market_segment=="Aviation"~0.2, market_segment=="Complementary"~0.4, market_segment=="Corporate"~0.6, market_segment=="Direct"~0.8, market_segment=="Groups"~1, market_segment=="Offline TA/TO"~1.2, market_segment=="Online TA"~1.4, market_segment=="Undefined"~0)) %>% 
  mutate(assigned_room_type_num = case_when(assigned_room_type=="A"~0.2, assigned_room_type=="B"~0.4, assigned_room_type=="C"~0.6, assigned_room_type=="D"~0.8, assigned_room_type=="E"~1, assigned_room_type=="F"~1.2, assigned_room_type=="G"~1.4, assigned_room_type=="K"~1.6)) %>% 
  mutate(customer_type_num = case_when(customer_type=="Contract"~0.2, customer_type=="Group"~0.4, customer_type=="Transient"~0.6, customer_type=="Transient-Party"~0.8)) %>% 
  mutate(distribution_channel_num = case_when(distribution_channel=="Corporate"~0.2, distribution_channel=="Direct"~0.4, distribution_channel=="GDS"~0.6, distribution_channel=="TA/TO"~0.8, distribution_channel=="Undefined"~0))

# Remove the categorical variables
df_CH3 <- select(df_CH2, -market_segment, -assigned_room_type, -distribution_channel,-customer_type)

Correlation (heat map)

# Plot the heat map
plot_correlation(na.omit(df_CH3))
## 1 features with more than 20 categories ignored!
## arrival_date: 787 categories

Based on the heat map we can notice that the correlation between our dependent variable “adults” and the other variables is very low and that is why we are going to drop them. For this reason, we are going to continue focusing on, only, our dependent variable and try to forecast it by itself. And as long as we want to make a prediction using a time series method, the most appropriate model in our case is ARIMA. Therefore, in the next step, we are going to check the autocorrelation to analyse the correlation between the values of the dependent variable.

5.7 Exporting the final data frame as csv file.

To develop our predictive model of the hotel demand, we are going to apply ARIMA model in Python. Therefore, we need to export our final data frame to re import in Python again.

In our final data frame we need to sum the values of the demand by date.

# Create the data frame 
df_FINAL <- df_CH %>% group_by(arrival_date) %>% summarise(Demand = sum(adults))

# Export the final data frame as a csv file.
write.csv(df_FINAL, "/Users/khalil/Desktop/JUB/Tools/df_FINAL.csv")

6 FINAL CONCLUSION